HiCDiff: single-cell Hi-C data denoising with diffusion models

Abstract The genome-wide single-cell chromosome conformation capture technique, i.e. single-cell Hi-C (ScHi-C), was recently developed to interrogate the conformation of the genome of individual cells. However, single-cell Hi-C data are much sparser than bulk Hi-C data of a population of cells, and noise in single-cell Hi-C makes it difficult to apply and analyze them in biological research. Here, we developed the first generative diffusion models (HiCDiff) to denoise single-cell Hi-C data in the form of chromosomal contact matrices. HiCDiff uses a deep residual network to remove the noise in the reverse process of diffusion and can be trained in both unsupervised and supervised learning modes. Benchmarked on several single-cell Hi-C test datasets, the diffusion models substantially remove the noise in single-cell Hi-C data. The unsupervised HiCDiff outperforms most supervised non-diffusion deep learning methods and achieves the performance comparable to the state-of-the-art supervised deep learning method in terms of multiple metrics, demonstrating that diffusion models are a useful approach to denoising single-cell Hi-C data. Moreover, its good performance holds on denoising bulk Hi-C data.


Introduction
The information about the three-dimensional (3D) conformation (structure) of a genome is important for analyzing and understanding its function such as gene regulation, enhancer-promoter interaction and genome methylation.Hi-C is a widely used highthroughput next-generation sequencing assay for measuring pairwise contacts between any pair of genomic loci [1].Hi-C chromosomal contact data have revealed important genome conformation features such as topologically associated domains (TADs) and chromatin compartments in both a population of (bulk) cells [2][3][4] and single cells [5][6][7].
However, Hi-C data usually contain substantial noise due to multiple factors.The most common one is that the amplification step for the library preparation in the experiments introduces the distance-dependent amplified bias such that a higher noise-to-signal ratio against genomic distance exists.The restriction enzymes of cutting contacted DNA fragments off from the genome also have a biased preference, leading to over-or underrepresentation of some contacts.It is important to remove the noise in Hi-C data, particularly in very sparse single-cell Hi-C data with very low signal-to-noise ratio, in order to use them well in the downstream applications [3,[8][9][10].
Hi-C paired-end reads are usually mapped to a reference genome and then converted into chromosomal contact matrices/maps [3,11].In a contact matrix (M), each entry M [i, j] contains the number of reads indicating the frequency of two fragments i and j being in contact.Several deep learning methods have been developed to impute and/or denoise Hi-C contact matrices, including HiCSR [12], DeepHiC [13], HiCPlus [14], VEHiCLE [15] and HiCARN [16] for processing bulk Hi-C data of a population of cells as well as SCHiCEDRN [17] for processing both single-cell and bulk Hi-C data and Higashi [18] and DeepLoop [19] for processing single-cell Hi-C data.All these methods are supervised learning methods, which require both labeled data (e.g.cleaner data or noiseless data) and noisy input data to train them.However, the pairs of labeled and noisy data may not always be available.The method trained on one kind of noisy and labeled data may not be applicable to another kind.
Inspired by both the Denoising Diffusion Probabilistic Models (DDPM) [20,21] and Denoising Diffusion Restoration Models (DDRM) [22] that have achieved success in denoising and generating images [23][24][25][26][27], we developed a diffusion model-HiCDiffthat can work in both supervised and unsupervised modes to denoise Hi-C data of both single cells and bulk cells.In both modes, HiCDiff uses a parameterized Markov chain model trained to learn the transition from noisy data to cleaner data to reverse a noise forward diffusion process of gradually adding Gaussian noise to Hi-C data.
The conventional diffusion models including DDPM, DDRM or SR3 [28] used to denoise data in other domains all employed U-Net architectures with small modifications to parameterize the Markov chain model in the reverse diffusion process.In this work, we adapted a DDPM with U-Net as the baseline method for denoising Hi-C data.Moreover, we designed HiCDiff that employs a residual network architecture with DDPM to denoise the Hi-C data of either a single cell or bulk cells.HiCDiff consistently performs better than DDPM on several single-cell and bulk Hi-C test datasets in terms of multiple evaluation metrics in both the unsupervised and supervised modes.Compared to the nondiffusion deep learning methods, HiCDiff outperforms most of them and is comparable to the state-of-the-art method.

Denoising diffusion framework and training algorithms of HiCDiff
HiCDiff uses the framework of the DDPM [20] to model how noise is added into data and it can be removed.It is inspired by nonequilibrium thermodynamics and uses a Markov chain to model how uncorrupted data (y 0 ) gradually diffuse into noise (y T ) from time 0 to T in the forward process (q): y 0 → y 1 → • • • → y T−1 → y T and how the noise (y T ) is gradually transformed back to the true/target data (y 0 ) in the inverse process (p): In the forward process, Gaussian noise is added to each time t − 1 to generate corrupted data at time t according to the distribution: q y t y t−1 := N y t ; √ 1 − β t y t−1 ; β t I , where β t is a variance schedule parameter at time t (1≤ t ≤ T) to control how much Gaussian noise is added at each time step t.Because of the Markov property of the forward process, y t can be directly sampled from y 0 according to the distribution: q y t |y 0 = N y t ; a t y 0 ; 1 − a t I where a t := 1 − β t , a t := t s=1 a s .The forward process can be carried out straightforwardly to generate corrupted data to train a model to reverse the data corruption process iteratively.It is worth noting that Gaussian noise is the most common noise used with diffusion models because of its nice mathematical properties such as analytically expressing y t as a Gaussian random variable, but it is not only one.Other kinds of noise such as Poisson noise and exponential noise may be considered in the future.
In the reverse (denoising) Markov chain process: y T → y T−1 → • • • → y 2 → y 0 , the joint distribution of the data points: p θ y 0:T can be calculated by the formula: p θ y 0:T := p θ y T T−1 t=0 p θ y t |y t+1 , where p θ y T = N y T ; 0; I and p θ y t |y t+1 = N y t ; μ θ y t+1 , t + 1 ; σ θ y t+1 , t + 1 .μ θ and σ θ are the mean and variance of y t that depends on y t+1 .The mean can be predicted from y t+1 by a neural network-based generative model with parameter θ trained on the data generated in the forward process.In practice, because the mean is proportional to the difference between y t+1 and the added noise and y t+1 has been provided as input, a neural network (f θ ) is usually trained to predict the Gaussian noise from y t+1 instead.Once the noise is predicted, the mean can then be calculated.σ θ can be set to the same as the variance of time step t in the forward process.Then, y t can be sampled from the Gaussian distribution with the mean and the variance.
To apply the probabilistic diffusion framework above to Hi-C chromosomal contact matrices, Algorithm 1 is used by HiCDiff to add Gaussian noise into a clean contact matrix (y 0 ) to generate a corrupted matrix (y t ) for any time t in the forward process.The corrupted contact matrices with the added noise are used to train a generative neural network to remove the noise in the reverse process.
The generative neural network model can be trained in either unsupervised learning mode (unconditional diffusion) or supervised learning mode (conditional diffusion).In the unsupervised Algorithm 1. Generate a noisy chromosomal contact matrix at time t in the forward process.
Input: y 0 , the noiseless chromosome matrix at the start time t 0 ; t, a time step in the range [ [ 29], the data (y 0 ) are fully corrupted into noise (y T ) in the forward process (Figure 1A), and then a generative model is trained to predict the noise to be removed from y t to estimate the mean of y t−1 .Once the model is trained, it can be used in the inverse process to recover y 0 step by step, starting from the complete Gaussian noise y T [30].The unsupervised mode is used when only presumably clean data (y 0 ) are provided for training.Algorithm 2 describes how a generative model (f θ ) is trained in the unsupervised mode to predict the noise at any time step until it converges.In contrast, the supervised learning mode can be used when pairs of noisy data x and clean data y 0 (target label) are provided to train the generative model.In this mode (Figure 1B), a noisy data matrix x (i.e. a condition) in conjunction with the corrupted data (y t ) is used as input for a generative model to predict the noise to be removed to obtain the mean of y t−1 in the inverse process.Algorithm 3 describes how a generative model (f θ ) is trained in the supervised mode to predict the noise to be removed.

Inference algorithms for denoising chromosomal contact matrices in HiCDiff
The where a t := 1−β t , a t := t s=1 a s and β t is a variance schedule parameter at time t used in the forward process.Supplementary Algorithm S1 describes the process of denoising a contact matrix in the unsupervised mode based on the diffusion relationship between Figure 1.The forward and inverse processes of the HiCDiff for unconditional and conditional diffusion.(A) The unconditional diffusion model (unsupervised mode).In the forward process (q) at the top, the clean data (y 0 ) are gradually corrupted until they become the complete noise (y t ), and in the reverse process (p) at the bottom the noise is gradually removed until the clean data are recovered.The generative neural network model for removing noise is trained in the unsupervised mode.(B) The conditional diffusion model (supervised mode).The forward and inverse processes of the conditional diffusion are the same as those of the unconditional diffusion except that a conditional x, which is provided as an extra input, is concatenated with y t at every time step t to train the generative neural network model to predict noise to be removed.Input: y 0 ,the noiseless chromosomal contact matrix at the start time t 0 ; x, a noisy chromosomal contact matrix provided as a condition Given a pair of (x, y 0 ), where y 0 ∼ q 0 y 0 and x is considered to be equal to a corrupted version of y 0 : x = Hy 0 + Z, where the linear degradation matrix H = I, Z ∼ N 0, σ 2 x I , and σ x is the noise level.

The deep learning architectures of generating cleaner data from noisier data in HiCDiff
We tested two different deep learning architectures to predict the noise that needs to be removed from the noisy data (y t ) to obtain the mean of cleaner data (y t−1 ) with or without an input condition (x).One is the standard U-Net with multi-head attention layers used in the classic DDPM models [20].This method is the baseline diffusion model, which is referred to as DDPM when compared with the other methods.Another one is a residual network illustrated in Figure 2, which performs better than the U-Net and is used as the final generative model in HiCDiff.The residual network consists of 32 customized residual blocks with one 3 × 3 convolution layer preceding them and another 3 × 3 convolution layer following them, which has demonstrated an outstanding performance on image super-resolution [31], inspiring us to apply it to Hi-C contact matrices, and is the same as the generator component used in a generative adversarial network method, SCHiCEDRN [17], for denoising Hi-C data.HiCDiff has 37 583 873 parameters, which is larger than the U-Net-based DDPM with 35 704 705 parameters.It is also larger than ScHiCEDRN with 24 770 300 parameters.During training, HiCDiff takes longer time than ScHiCEDRN to converge.However, similarly as noticed in [32], it is more challenging to optimize the objective function balancing the losses of the Generator and Discriminator in the Generative Adversarial Network (GAN) model of ScHiCEDRN than to minimize the squared error of predicted noise and added noise in HiCDiff during training.So, the difficulty of training the two methods is largely comparable.
Both the unsupervised (Figure 2A) and supervised (Figure 2B) HiCDiff use the same residual network architecture during the training and inference, except that the number of input channels of the first convolution layer of the supervised HiCDiff is twice that of the unsupervised HiCDiff because the former takes an extra condition matrix x as input in addition to the noisy data matrix y t used by the latter.Once the unsupervised and supervised models (f θ ) are trained by Algorithm 2 and 3, respectively, they can be used by Algorithms S1 and S2, respectively, to generate the cleaner data (y t−1 ) iteratively in the reverse process of diffusion.

Datasets for training, validating and testing HiCDiff
Hi-C data of two different cell lines [human frontal cortex and Drosophila male Dm-BG3c2 (BG3) cells], including a single-cell Hi-C dataset and the corresponding bulk Hi-C dataset for each, were downloaded from the Gene Expression Omnibus (GEO) database [33,34].The single-cell Hi-C dataset and bulk Hi-C dataset of the human cell line (GEO no.: GSE130711) contain the data of 24 chromosomes (Chr.1-22, X and Y) [35].The single-cell Hi-C dataset and corresponding bulk Hi-C dataset of Drosophila cell line (GEO no.: GSE131811) contain the data of seven chromosomes (chr2L, chr2R, chr3L, chr3R, chr4, chrX and chrM) [36].Since both the human single-cell dataset and the Drosophila single-cell dataset contain the Hi-C data of many individual cells, the data of three randomly chosen individual cells from the human cell line and two randomly chosen individual cells from the Drosophila cell line were used to train and test the denoising methods.
The 40 kilobase (kb) resolution that can be well distinguished by the sparse single-cell Hi-C data was used to produce chromosomal contact matrices for both the single-cell Hi-C data and corresponding bulk Hi-C data.For the human cell line, the singlecell Hi-C data of one cell and the bulk Hi-C data for Chr. 1, 4, 6, 7, 9, 10, 13, 14, 15, 16, 17, 19, 20 and 22 are used as single-cell and bulkcell training data, respectively (called human_cell_1_training_data and human_population_training_data, respectively), those for Chr. 3, 5, 11 and 21 are used as the single-cell and bulkcell validation data (called human_cell_1_validation_data and human_population_validation_data, respectively), and those for Chr. 2, 8, 12 and 18 are used as the single-cell and bulk-cell test dataset (called human_cell_1_test_data and human_population_test_data, respectively) to evaluate if the model can generalize from one chromosome to another one.Moreover, the single-cell data of another two randomly selected human cells are used as an The single-cell Hi-C data of two randomly selected Drosophila cells and the corresponding bulk Hi-C data for five chromosomes (chr2L, chr2R, chr3L, chr3R, chr4, chrX and chrM) are used as the single-cell and bulk-cell test datasets, respectively (called drosophila_cells_test_data and drosophila_population_test_data, respectively) to evaluate if the diffusion model trained on the data of the human cell can generalize to Drosophila.
The original Hi-C chromosomal contact matrices in all the datasets downloaded above were normalized into the range [0, 1] first according to [13] and then converted into the range [−1, 1] by the formula y out = 2y input − 1 because the diffusion process requires the input value is in the range [−1, 1] [20].The normalized original data were treated as clean data (i.e. the ground truth labels).The normalized original chromosomal contact matrices were preprocessed by adding Gaussian noise [12,22] into them by the formula x = y out + Z where Z ∼ N 0, σ 2 x I to generate noisy input data for denoising.The Gaussian noise is chosen here because it has a f lat power spectral density across all frequencies, it has equal power at all frequencies within its bandwidth and it has been used to corrupt Hi-C data in the HiCSR model [12] and images in the DDRM model [22].σ x is the noise level in the range [0, 1].The higher the noise level in range [0, 1], the more noise is added, i.e. 0 means no Gaussian noise and 1 means full Gaussian noise.Two different noise levels (0.1 and 0.5) were used to generate noisy input data for the denoising tasks.For each noise level (0.1 or 0.5), there are 14 noisy chromosomal contact matrices in the single-cell or bulk training dataset (human_cell_1_training_data and human_population_training_data), 4 noisy chromosomal contact matrices in the single-cell or bulk validation dataset (human_cell_1_validation_data and human_population_validation_data), 4 noisy chromosomal contact matrices of Chr. 2, 8, 12 and 18 in each of the two human single-cell test datasets (i.e.human_cell_1_test_data and human_cells_2_3_test_data), 7 noisy chromosomal contact matrices of chr2L, chr2R, chr3L, chr3R, chr4, chrX and chrM in the Drosophila single-cell test dataset (i.e.drosophila_cells_test_data), 4 noisy chromosomal contact matrices of Chr. 2, 8, 12 and 18 in the human bulk Hi-C test dataset (i.e.human_population_test_data), and 7 noisy chromosomal contact matrices of chr2L, chr2R, chr3L, chr3R, chr4, chrX and chrM in Drosophila bulk Hi-C test dataset (drosophila_population_test_data).
Furthermore, 64×64 sub-matrices were cropped from the noisy and clean chromosomal matrices in the training and validation datasets to create the inputs and labels to train and validate the methods.Training HiCDiff and DDPM was quite different from training the traditional non-diffusion deep learning methods.The non-diffusion methods were trained to predict clean data from noisy input data in the supervised mode.In contrast, the diffusion models (HiCDiff and DDPM) can be trained in both the unsupervised and unsupervised mode.In the unsupervised mode, they were fed with the clean data only, Gaussian noise was gradually added to the data in the forward diffusion process (Algorithm 1) and they were trained to recover the clean data iteratively in the reverse process (Algorithm 2).In the supervised mode, they were fed with both the clean data (labels) and down-sampled noisy data (i.e.condition) as input and were trained on them to learn to recover the clean data (Algorithm 3).During training, the validation datasets were used to inspect the convergence of the training process and select the best trained model for testing.
After the training, all the methods were blindly tested on the same test datasets to compare their performance.During the testing, a test matrix was divided into 64 × 64 sub-matrices with zero padding if necessary for a pretrained method to generate denoised sub-matrices.The denoised sub-matrices were then assembled into the full denoised matrix, which was compared with the ground truth (clean) matrix to evaluate its quality.

Evaluation metrics
HiCDiff and DDPM diffusion models trained in the unsupervised mode (HiCDiff1 and DDPM1) and in the supervised mode (HiCDiff2 and DDPM2) were compared with several non-diffusion deep learning methods including DeepHiC [13], HiCSR [12], SCHiCEDRN [17], HiCPlus [14] and Loopenhance (DeepLoop) [19] in terms of multiple image-based evaluation metrics (PSNR: peak signalto-noise ratio; SSIM: Structural Similarity Index Measure; MSE: mean squared error; SNR: signal-to-noise ratio), and the Hi-C reproducibility metric-the HiCRep score [37] of measuring the similarity between denoised chromosomal contact matrices and clean matrices.To compare the denoised data generated by different methods against the clean data (target) after training, they were renormalized back into the same range [0, 1] according to [22] by the formula y out = y input +1 2 , −1 ≤ y input ≤ 1 before they were compared with the clean data that were also renormalized back into the range [0, 1] for the comparison because some Table 1.The results of the four unsupervised and supervised HiCDiff and DDPM diffusion models as well as five supervised non-diffusion deep learning methods on the human_cell_1_test_data at two input noise levels (0.1 and 0.5) The results of the input data without denoising are also shown as the baseline.Bold fonts with ' * ' and ' * * ' denote the best and second-best results, respectively.The unsupervised HiCDiff1 and DDPM1 used the linear noise variance schedule in its forward process, while the supervised HiCDiff2 and DDPM2 used the sigmoid noise variance schedule.

Type
image-based metrics such as SSIM are usually applied to nonnegative matrices [ 22] and HiCRep [37] requires two contact matrices in comparison to have non-negative values.

Diffusion models substantially improve the quality of single-cell Hi-C data
The quality of the chromosomal contact matrices denoised by the unsupervised and supervised diffusion models (HiCDiff1, DDPM1, HiCDiff2 and DDPM2) and that of the input matrices in the human_cell_1_test_data at the two noise levels are first compared in Table 1.All the four diffusion models substantially improve the quality of the input data according to the four image-based evaluation metrics.For instance, at the noise level of 0.1, the unsupervised HiCDiff1 increases PSNR from 28.9701 to 42.4171, SSIM from 0.1885 to 0.9662 and SNR from 27 289 to 119 604, and reduces MSE from 0.0013 to 0.00006; and at the higher noise level of 0.5, the improvement is generally more pronounced.Among the four diffusion models, the unsupervised diffusion models (HiCDiff1 and DDPM1) work better than their supervised counterparts (HiCDiff2 and DDPM2), respectively, indicating that the unsupervised (unconditional) mode is generally more effective than the supervised (conditional) mode for denoising the singlecell Hi-C data.Moreover, HiCDiff1 (or HiCDiff2) generally performs better than DDPM1 (or DDPM2), showing that the residual network architecture used by HiCDiff is more effective than the U-Net used by DDPM.Among the four diffusion models, the unsupervised HiCDiff (HiCDiff1) performs best on this dataset.Similar results are observed on the single-cell Hi-C data of two human cells 2 and 3 in human_cells_2_3_test _data and the two Drosophila cells in drosophila_cells_test_data (supplemental Tables S1 and S2).Because the human cells 2 and 3 as well as the Drosophila cells were not used in the training at all, the results show that the diffusion models trained on one cell can work well on other cells of both the same species and different species and therefore are highly generalizable.
We also compare the four diffusion models trained on the bulk Hi-C data on the bulk Hi-C test datasets (human_population_test data and drosophila_population_test_data) (supplemental Table S3 and Table S4).At the high noise level of 0.5, the unsupervised HiCDiff1 still performs best on both datasets.At the low noise level of 0.1, the supervised HiCDiff2 works best on human_population_test_data, while unsupervised HiCDiff1 works best on drosophila_population_test_data.Considering all the situations together, the unsupervised HiCDiff1 still performs best on the population Hi-C Data.It is worth noting that the amount of the improvement of the best diffusion model over the input data on the bulk Hi-C data (supplemental Table S3 and Table S4) is generally lower than on the single-cell Hi-C data (supplemental Table S1 and Table S2) probably because the singlecell Hi-C data are much sparser than the bulk Hi-C data and therefore have a larger room of improvement.
To test the robustness of each model against different resolutions of Hi-C data, we evaluated all the models' performance on Drosophila single-cell Hi-C data of two additional resolutions: 20 kb and 10 kb, respectively.The results are shown in supplemental Table S5.HiCDiff1 still works very well at different high resolutions and outperforms ScHiCEDRN in both cases.
Because chromosomes 4 and M (mitotic) of Drosophila are very short and its chromosome X only has one copy, we performed an experiment to test whether the three chromosomes had a big effect on the results.The results of the methods on the Hi-C data with/without the three chromosomes are shown Table S6.The results show that, after excluding the three chromosomes, HICDiff1's relative performance is further improved, and it outperforms all other methods including ScHiCEDRN.
To further validate all the methods on a Hi-C dataset whose true chromosomal contacts are known, we use a clean Hi-C chromosomal contact dataset of yeast [38] that was rigorously curated by the authors at the false discovery rate of 1% [38] as ground truth to evaluate them.The dataset is corrupted at the noise level of 0.1 as input for the methods to remove noise.The results are shown in supplementary Table S7.The unsupervised HiCDiff1 outperforms all the methods but the supervised ScHiCEDRN, and its performance is comparable to ScHiCEDRN.
Finally, it is worth pointing out that in the experiments above, the Gaussian noise added into the Hi-C contact matrices is not symmetric (i.e.noise[i,j]!= noise[j,i]; i, j are the indices of two chromosomal bins).To test if the symmetry of the noise affects the performance of the methods, we compare the supervised denoising methods on the single-cell Hi-C data of drosophila cells 1 and 2 in drosophila_cells_test_data corrupted at the symmetric and non-symmetric noise level of 0.1, respectively.Because the symmetric noise does not strictly follow the Gaussian distribution, the predefined linear degradation matrix H in the unsupervised diffusion model cannot be theoretically derived, and the unsupervised diffusion methods are not included into the  S8 show that the supervised diffusion models can substantially improve the quality of the data with either symmetric or asymmetric noise.The performance of all the supervised methods on denoising the data with the symmetric noise is slightly better than on the data with the asymmetric noise.In both situations, the supervised HiCDiff2 outperforms all the other supervised methods but ScHiCEDRN.Overall, the symmetry of noise has only a minor positive impact on the performance of the methods.

Unsupervised HiCDiff outperforms most supervised methods and is comparable to the best supervised method for denoising Hi-C data in terms of image-based metrics
The results of the diffusion models are also compared with the other five supervised non-diffusion deep learning methods (Deep-HiC, hiCSR, HiCPlus, Loopenhance and ScHiCEDRN) on the singlecell Hi-C data of Human cell 1 (human_cell_1_test_data) in terms of the image-based evaluation metrics (Table 1) for two input noise levels: 0.1 and 0.5.The unsupervised HiCDiff (HiCDiff1) substantially outperforms all the supervised non-diffusion methods (e.g.DeepHiC, Loopenhance, HiCPlus and HiCSR) except ScHiCEDRN, while its performance is close to ScHiCEDRN.Similar results are observed on the two other human cells in human_cells_2_3_test (supplemental Table S1) as well as on the two Drosophila cells in drosophila_cells_test_data (supplemental Table S2).One unique feature of the unsupervised diffusion model like HiCDiff1 is that it only needs the clean data for training, which is different from the supervised methods that require pairs of clean and noisy data for training.
Figure 3 visually compares the chromosome matrices denoised by all the methods for the region (102.40-104.96Mb) of Chr. 2 (Figure 3A) and the region (307.20-309.76Mb) of Chr. 12 (Figure 3B).The sub-matrix with the highest contrast among these methods is highlighted by a green square in the first row of the matrices in Figure 3A or Figure 3B.The highlighted sub-matrix is enlarged in the second row in each sub-figure, which shows that the data denoised by the diffusion models (HiCDiff and DDPM) are more similar to the target (clean) matrix than DeepHiC, hiCSR, HiCPlus and Loopenhance, while HiCDiff1 performs comparably to ScHiCEDRN.
In addition to comparing the diffusion methods with the supervised non-diffusion methods on the single-cell Hi-C data, we also trained them on the human bulk Hi-C training data and then tested them on the two bulk Hi-C test datasets (human_population_test_data and drosophila_population_test_data) (see results in Tables S3 and S4).Largely similar results are seen on the bulk Hi-C data as on the single-cell Hi-C data.The unsupervised or supervised HiCDiff performs generally better than almost all the supervised non-diffusion methods (DeepHiC, HiCPlus, Loopenhance and HiCSR) but a little worse than ScHiCEDRN.The results demonstrate that the diffusion models are also effective in denoising bulk Hi-C data.

HiCDiff achieves the state-of-the-art performance in terms of Hi-C reproducibility metric
HiCRep [37] is a metric for systematically assessing the reproducibility of Hi-C data to capture the spatial features such as distance dependance or domain structure that are often neglected by other evaluation metrics.The higher the HiCRep scores, the better the denoised Hi-C data are.In terms of the metric, we evaluated all the methods on the data of test chromosomes of the same human cell (human_cell_1_test_data) (Figure 4), the data of the two unseen human cells in human_cells_2_3_test_data (Figure 5) and two unseen Drosophila cells in drosophila_cells_test_data (Figure 5).
On the test data of the same human cell 1 (Figure 4A) at the lower input noise level of 0.1, the performance of the unsupervised HiCDiff (HiDiff1) is second to a supervised method (ScHiCEDRN) and much better than the other four supervised non-diffusion methods (DeepHiC, HiCPlus, Loopenhance and HiCSR); at the higher input noise level of 0.5 (Figure 4B), the supervised HiCDiff (HiCDiff2) performs much better than ScHiCEDRN and the other non-diffusion supervised methods.
On human_cells_2_3_test_data, at the lower input noise level 0.1 (Figure 5A), ScHiCEDRN performs best, HiCDiff1 works the second best and the diffusion methods (HiCDiff1, HiCDiff2, DDPM1 and DDPM2) outperform the other four supervised non-diffusion methods (DeepHiC, HiCPlus, Loopenhance and HiCSR); at the  higher input noise level 0.5 ( Figure 5B), the supervised HiCDiff2 performs best among all the methods.
On the test data of the drosophila_cells_test_data, at the two noise levels (0.1 and 0.5) (Figure 5C and Figure 5D), the four diffusion models (HiCDiff1, DDPM1, HiCDiff2 and DDPM2) and ScHiCEDRN have a similar performance, which is much better than that of the other four supervised non-diffusion methods.
We also evaluated all the methods on the human bulk Hi-C data in terms of HiCRep scores (Figure S1).The four unsupervised/supervised diffusion models (HiCDiff1, DDPM1, HiCDidff1 and DDPM2) and two supervised deep learning methods (ScHiCEDRN and HiCSR) perform similarly at the two input noise levels, and they substantially outperform the other three nondiffusion methods (DeepHiC, HiCPlus and Loopenhance).
Considering all the results above together, all the methods generally have higher HiCRep scores on the bulk Hi-C data than on the single-cell Hi-C data, which may be due to the high sparsity of the single-cell Hi-C data.On the single-cell Hi-C data, at higher input noise level (0.5), the supervised HiCDiff2 performs best and ScHiCEDRN second, while at lower input noise level (0.1), ScHiCE-DRN has the best performance and the unsupervised HiCDiff1 the second.Following the method in [ 17], we evaluated how well the Hi-C data denoised by each method could be used to identify an important chromosome feature-TADs on different cells and cell lines.The difference between the TAD insulation score vector for each contact matrix denoised by each method and the insulation score vector of the original clean contact matrix is measured by the L2 norm dissimilarity score as shown in supplemental Figure S2.The smaller the L2 norm dissimilarity score, the better a method can denoise a matrix for TAD identification.The results show that HiCDiff2 performs second best, next only to ScHiCEDRN.

The impact of the noise variance schedule on the performance of HiCDiff
The noise variance schedule (β t , 1 ≤ t ≤ T) in the diffusion model that controls how to gradually add Gaussian noise to the data in the forward diffusion process can inf luence its performance.We conducted an ablation study to investigate its impact on HiCDiff's performance.The results of two different variance schedules (linear, sigmoid) on the human_cell_1_test_data at the input noise level of 0.1 are reported in Table 2.In the unsupervised mode, the linear variance schedule performs obviously better than the sigmoid variance schedule, while in the supervised mode, the sigmoid variance schedule performs slightly better than the linear variance schedule in terms of PSNR and SSIM, equally in terms of MSE and slightly worse in terms of SNR.Overall, HiCDiff with the linear schedule in the unsupervised works best on the single-cell Hi-C dataset.
The results of the different variance schedule on the bulk Hi-C data (human_population_test_data) at the input noise level of 0.1 are reported in Table S9.Similarly, HiCDiff with the linear variance schedule performs better than sigmoid variance schedule in the unsupervised mode, while the situation is opposite in the supervised mode.

Conclusion
We developed the first diffusion models (HiCDiff and DDPM) to denoise single-cell Hi-C data in both unsupervised and supervised learning modes.The diffusion model can substantially improve the quality of the noisy Hi-C data.HiCDiff based on the residual network generally performs better than DDPM based on the standard U-Net in terms of multiple metrics.On multiple test datasets of two input noise levels from the same cell, different single cells and cells of different species, the unsupervised HiCDiff achieved the performance similar to the state-of-the-art supervised ScHiCEDRN and better than all other non-diffusion supervised methods, demonstrating that diffusion models are a useful approach to the single-cell Hi-C data denoising problem.Its performance generalizes well from one cell to another cell and from one species to another species.Moreover, HiCDiff performs well in denoising bulk Hi-C data.

Key Points
• A novel generative artificial intelligence (AI) method based on diffusion models (HiCDiff) was developed to denoise single-cell Hi-C data.• HiCDiff can effectively remove noise in single-cell Hi-C data in both unsupervised and supervised mode.• The unsupervised HiCDiff achieves the performance on par with the state-of-the-art supervised deep learning methods in the field.• HiCDiff can also be applied to denoise bulk Hi-C data.number: GSE130711) and population human Hi-C data (GEO accession number: GSE130711) were downloaded from https:// salkinstitute.app.box.com/s/fp63a4j36m5k255dhje3zcj5kfuzkyj1.The Cooler file containing both the single-cell Hi-C data of two Drosophila cells (GEO accession number: GSE131811) and population Drosophila Hi-C data (GEO accession number: GSE131811) were obtained from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE131811.All the original and processed datasets can also be downloaded from Zenodo: https://doi.org/10.5281/zenodo.10223407.

Algorithm 3 .
Training a generative neural network model in the supervised mode.

2 7: Until converged 8 :
Return f θ the original and spectral spaces, and supplementary Algorithm S2 describes how to denoise a contact matrix in the supervised mode.

Figure 2 .
Figure 2. The architecture of the deep residual network used by HiCDiff to predict the noise (ˆ ) to be removed from a corrupted input matrix (y t ) at time t.(A) The architecture of the unsupervised learning mode.It has one convolutional layer (filter size: 3 × 3 × C; C: the number of input channels), followed by 32 residual blocks (Resblock) and a final convolutional layer to predict the noise.(B) The architecture of the supervised learning mode.X is the conditional source matrix provided by users.(C) The layers of the residual block (Resblock).

Figure 3 .
Figure 3.The heatmaps visualize the contact matrices denoised by the 9 denoising methods for Chr. 2 and Chr. 12 in human_cell_1_test_data. (A) The matrices of region 102.40-104.96Mb of Chr. 2. The noisy input matrix and the clean target matrix are visualized at the beginning and the end, respectively.(B) The matrices for the region 307.20-309.76Mb of Chr.12.In the first row in (A) and (B), the green squares highlight the sub-region (102.72-103.20 Mb) of Chr. 2 and (307.52-308.00Mb) of Chr. 12 with more pronounced difference between the matrices, which are enlarged in the second row.

Figure 4 .
Figure 4.The box plot of the average HiCRep scores on human_cell_1_test_data at two different input noise levels (0.1 and 0.5).(A) Input noise level 0.1, (B) input noise level 0.5.

Figure 5 .
Figure 5.The box plot of the average HiCRep scores on human cells 2 and 3 of human_cells_2_3_test_data and Drosophila cells 1 and 2 of drosophila_cells_test_data at two noise levels.(A) On human_cells_2_3_test_data at input noise level 0.1, (B) on human_cells_2_3_test_data at input noise level 0.5, (C) on drosophila_cells_test_data at input noise level 0.1 and (D) on drosophila_cells_test_data at input noise level 0.5.

Table 2 .
The performance of the HiCDiff with the linear and sigmoid noise variance schedules on human_cell_1_test_data at the input noise level of 0.1 * ' and ' * * ' denote the best and second-best results, respectively.